Upgrade to Python 3.14, enable comprehensive ruff linting, and harden CI#5
Conversation
Bumps the project to Python 3.14 across devcontainer, CI, mypy, ruff, and pyrefly configs, and upgrades the full dependency tree (langchain 1.x, gradio 6.x, pydantic 2.13, etc.) to versions with 3.14 wheels. - Replace pre-commit with prek (faster, drop-in compatible) as the git hook runner; bump the ruff/pyupgrade hook revs and pin default_language_version to python3.14 so hook environments match the project interpreter. - gradio 6 dropped Chatbot(type="messages") now that the messages format is the only one; remove the obsolete kwarg. - langchain 1.x removed the langchain.schema compat shim; switch to langchain_core.documents.Document. - langchain-huggingface now gates sentence-transformers/transformers behind a "full" extra; request it explicitly (including in the mypy hook's additional_dependencies) so HuggingFaceEmbeddings keeps working and type-checks cleanly. - click 8.2+ no longer mixes stderr into CliRunner stdout by default; update CLI exception-message assertions to check result.output. - Apply ruff's py314-target modernizations (PEP 604/758 syntax, simplified Generator[] hints). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expands ruff's lint selection from a minimal E/F/I/UP set to ~50 rule categories (bugbear, bandit, pathlib, pydocstyle, simplify, pylint, tryceratops, type-checking, and more), with every ignore -- global and per-file -- carrying an inline comment explaining why. Highlights beyond mechanical cleanup: - Logging: all f-string log calls converted to lazy %-formatting; bare root-logger calls replaced with named loggers; .error() -> .exception() in except blocks so tracebacks are actually captured. - Migrated ~176 os.path call sites to pathlib.Path across the crawler, parser, vectorizer, and config modules; updated tests to mock Path methods instead of os.* accordingly. - Reverted an initial attempt to move imports behind TYPE_CHECKING (TC001-003): it caused a real NameError because gradio's event-handler wiring eagerly resolves type hints, and this project has no `from __future__ import annotations` to make deferred imports safe. Documented and ignored instead. - Removed ChromaStore's reliance on langchain's private `_collection` attribute, rewriting query_with_embedding/get_document against the public Chroma API. - Implemented multi-turn chat history support in LLMService/ RAGOrchestrator: the `history` parameter was flagged as unused (ARG002) but was actually wired in from the Gradio app and simply never consumed. Threads it into the ollama messages list (capped via MAX_HISTORY_MESSAGES) and fixes a double-counting bug where respond_stream included the just-appended current message in history. - Reduced real complexity (C901/PLR0911/PLR0915) in check_model_availability, BaseCrawler._crawl_url, and the `parse` CLI command via extraction, rather than raising thresholds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous CI lint steps used `ruff check --fix` and `ruff format`, which mutate files in place instead of failing the build -- a PR with unformatted or lint-violating code would pass CI silently. Since not every contributor runs the local prek hook, replace those steps with `prek run --all-files`, which fails on any finding (including the yaml/json/toml checks, pyupgrade, and trailing-comma hooks that weren't being run in CI at all). mypy and pyrefly stay as separate steps against the project's own already-synced venv, rather than have prek spin up a second, duplicate dependency set just for those two hooks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributors unfamiliar with GitHub Actions shouldn't need to open raw step logs to see what failed. Adds three native GitHub Actions features for in-PR visibility: - Ruff findings now also run with --output-format=github, producing inline annotations on the "Files changed" tab. - Every check (prek, mypy, pyrefly, pytest, coverage) now runs with continue-on-error so a single push surfaces all problems at once, instead of stopping at the first failure and requiring a fix/push/fail cycle to discover the next one. - A job summary step renders a pass/fail table for all checks plus the coverage percentage on the workflow run's Summary tab. - A final gating step explicitly fails the job if any check's outcome wasn't "success", preserving correct CI enforcement despite the continue-on-error used to let every check run independently. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 36 minutes and 42 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThe PR updates project settings to Python 3.14, refactors file I/O to ChangesPython 3.14 upgrade and runtime refactors
Sequence Diagram(s)sequenceDiagram
participant TapioAssistantApp
participant RAGOrchestrator
participant LLMService
participant Ollama
TapioAssistantApp->>RAGOrchestrator: query_stream(query_text, history=prior_turns)
RAGOrchestrator->>LLMService: generate_response_stream(prompt, system_prompt, history=prior_turns)
LLMService->>Ollama: chat(messages from prompt, system_prompt, history)
Ollama-->>LLMService: streamed chunks
LLMService-->>RAGOrchestrator: streamed chunks
RAGOrchestrator-->>TapioAssistantApp: streamed response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Direct equality comparisons of floats are fragile in general, even though this particular value passes through a mock unchanged. Use the idiomatic pytest.approx instead.
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (3)
tests/vectorstore/test_chroma_store.py (1)
202-204: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueSonarCloud float-equality hint here is effectively a false positive.
scoreis the literal0.1returned by the mock and flows through unchanged (no arithmetic), so the exact comparison is deterministic. Optionally switch topytest.approx(0.1)to silence the rule.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/vectorstore/test_chroma_store.py` around lines 202 - 204, The Chroma store test is using a direct float equality check on score, which SonarCloud flags even though the mock returns the literal value unchanged. Update the assertion in the test around the results[0] unpacking to use a float-tolerant comparison such as pytest.approx for score, while keeping the citation_url assertion unchanged.Source: Linters/SAST tools
.vscode/extensions.json (1)
2-7: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winRecommend Ruff here too.
.devcontainer/devcontainer.jsonalready includescharliermarsh.ruff, but the new workspace recommendations omit it. That leaves non-container contributors without the editor extension this PR standardizes on for lint/format workflows.Suggested change
"recommendations": [ "anthropic.claude-code", "ms-vscode-remote.remote-containers", "github.vscode-pull-request-github", - "ms-python.python" + "ms-python.python", + "charliermarsh.ruff" ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.vscode/extensions.json around lines 2 - 7, The workspace recommendations in extensions.json are missing the Ruff extension that this PR standardizes on. Update the recommendations list in the extensions.json configuration to include charliermarsh.ruff alongside the existing entries so non-container users get the same lint/format support as the devcontainer setup..pre-commit-config.yaml (1)
16-19: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUpdate pyupgrade to target Python 3.14
.pre-commit-config.yaml:19still uses--py313-plus; switch it to--py314-plusso pre-commit and CI apply 3.14-specific rewrites.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.pre-commit-config.yaml around lines 16 - 19, The pyupgrade pre-commit hook is still targeting Python 3.13, so update the hook configuration in .pre-commit-config.yaml to use the pyupgrade args for Python 3.14 instead. Locate the pyupgrade hook entry and change the existing --py313-plus setting to the 3.14 equivalent so pre-commit and CI apply the correct rewrites.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci.yaml:
- Around line 34-38: The workflow uses floating action tags, so update the
third-party action references in the CI workflow to full commit SHAs instead of
version tags. Replace the existing uses entries such as actions/cache@v4 with
pinned SHAs, and apply the same change to the other external actions in this
workflow so the job definitions remain stable and reproducible.
In `@pyproject.toml`:
- Around line 110-124: The Ruff per-file ignores for tests are only applied to
files directly under tests, so nested test modules still get linted. Update the
[tool.ruff.lint.per-file-ignores] pattern from tests/* to a recursive
tests/**/*.py pattern so all test files under tests, including subdirectories,
inherit the same ignores.
In `@README.md`:
- Line 125: The README sentence in the all-contributors section is grammatically
awkward and should be rewritten. Update the text to use a natural phrasing like
“Contributions of any kind are welcome!” and keep the surrounding
all-contributors description unchanged.
In `@tapio/crawler/crawler.py`:
- Around line 331-338: The containment check in crawler.py should compare path
components rather than string prefixes, since the current resolve() plus
startswith() logic in the path validation block can be bypassed by paths sharing
the same text prefix. Update the check around abs_full_path and abs_output_dir
to use abs_full_path.is_relative_to(abs_output_dir) on Python 3.9+, with a
relative_to() fallback if needed, and keep the existing ValueError path for
rejected URLs.
In `@tapio/parser/parser.py`:
- Around line 558-560: The path normalization in the nested-path branch of
parser.py is removing every ".html" substring instead of only the file
extension. Update the logic around the rel_path handling in the nested path
return so it strips only a trailing ".html" suffix, preserving any mid-path
occurrences, and keep the behavior localized to that nested-path branch.
In `@tapio/services/document_retrieval_service.py`:
- Line 56: The INFO log in DocumentRetrievalService.retrieve_documents is
persisting the raw user-supplied query_text, which should be avoided. Update the
logger.info call to stop printing the full query and instead log a request
identifier plus non-sensitive metadata such as query length or a redacted/hash
value. Keep the change localized to the retrieval logging in
DocumentRetrievalService so the request can still be traced without exposing
PII.
In `@tapio/utils/markdown_utils.py`:
- Line 71: The Python 2-style exception clause is invalid under Python 3 and
prevents module import; update the affected except blocks in markdown_utils and
the matching clause in utils/__init__.py to use a proper multi-exception form.
Locate the bad syntax in the exception-handling logic around the markdown
utility parsing path and any shared import-time helpers, then replace the
comma-separated exception list with a tuple so the modules remain importable.
In `@tapio/utils/text_utils.py`:
- Around line 75-80: In chunk_html_content, validate that chunk_overlap is
strictly smaller than chunk_size before computing estimated_chunks or any
fallback step size. Add an early fail-fast check near the top of the function
(before the chunk estimation logic) that raises ValueError for chunk_overlap >=
chunk_size, so the division and later chunking paths never see a zero or
negative step.
---
Nitpick comments:
In @.pre-commit-config.yaml:
- Around line 16-19: The pyupgrade pre-commit hook is still targeting Python
3.13, so update the hook configuration in .pre-commit-config.yaml to use the
pyupgrade args for Python 3.14 instead. Locate the pyupgrade hook entry and
change the existing --py313-plus setting to the 3.14 equivalent so pre-commit
and CI apply the correct rewrites.
In @.vscode/extensions.json:
- Around line 2-7: The workspace recommendations in extensions.json are missing
the Ruff extension that this PR standardizes on. Update the recommendations list
in the extensions.json configuration to include charliermarsh.ruff alongside the
existing entries so non-container users get the same lint/format support as the
devcontainer setup.
In `@tests/vectorstore/test_chroma_store.py`:
- Around line 202-204: The Chroma store test is using a direct float equality
check on score, which SonarCloud flags even though the mock returns the literal
value unchanged. Update the assertion in the test around the results[0]
unpacking to use a float-tolerant comparison such as pytest.approx for score,
while keeping the citation_url assertion unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b71b1f42-a291-433a-8a0a-d05ea94b788f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (50)
.devcontainer/Dockerfile.devcontainer/devcontainer.json.github/workflows/ci.yaml.pre-commit-config.yaml.python-version.vscode/extensions.jsonREADME.mdmise.tomlmypy.inipyproject.tomlpyrefly.tomlpytest.initapio/app.pytapio/cli.pytapio/config/config_manager.pytapio/config/config_models.pytapio/crawler/__init__.pytapio/crawler/crawler.pytapio/crawler/runner.pytapio/factories.pytapio/models/__init__.pytapio/models/document.pytapio/parser/__init__.pytapio/parser/parser.pytapio/prompts/__init__.pytapio/prompts/prompt_loader.pytapio/services/__init__.pytapio/services/document_retrieval_service.pytapio/services/llm_service.pytapio/services/rag_orchestrator.pytapio/utils/__init__.pytapio/utils/embedding_utils.pytapio/utils/markdown_utils.pytapio/utils/text_utils.pytapio/vectorstore/chroma_store.pytapio/vectorstore/vectorizer.pytests/config/test_config_manager.pytests/crawler/test_crawler.pytests/integration/test_rag_pipeline.pytests/parser/test_parser.pytests/parser/test_relative_links.pytests/prompts/test_prompt_loader.pytests/services/test_document_retrieval_service.pytests/services/test_llm_service.pytests/test_cli.pytests/test_gradio_app.pytests/utils/test_markdown_utils.pytests/utils/test_text_utils.pytests/vectorstore/test_chroma_store.pytests/vectorstore/test_vectorizer.py
💤 Files with no reviewable changes (2)
- tests/parser/test_relative_links.py
- pytest.ini
| [tool.ruff.lint.per-file-ignores] | ||
| "tests/*" = [ | ||
| "ANN", # tests are excluded from mypy and don't need type annotations | ||
| "ARG001", # unused fixture/mock arguments are normal in tests | ||
| "ARG002", # unused mock-override arguments are normal in tests | ||
| "D", # docstrings aren't required on test functions | ||
| "INP001", # tests intentionally use implicit namespace packages | ||
| "PLC0415", # local imports for mocking/patching timing are idiomatic in tests | ||
| "PLR0913", # test functions naturally take many fixture arguments | ||
| "PLR2004", # magic values in test assertions/fixtures are fine | ||
| "PT009", # some test classes intentionally use unittest.TestCase-style assertions | ||
| "PTH", # os.path-based temp file/dir handling in test fixtures is fine | ||
| "S101", # assert is the standard pytest idiom | ||
| "SLF001", # tests intentionally exercise private members | ||
| ] |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
In Ruff, does a per-file-ignorespattern oftests/match nested files liketests/services/test_llm_service.py, or is tests/**/.py required?
💡 Result:
In Ruff, a per-file-ignores pattern of tests/* is generally insufficient for matching files in nested subdirectories because it is a non-recursive, single-level glob pattern [1]. While some users have reported behavior where single-level patterns occasionally match subdirectories depending on specific implementation details or environment conditions [2], the standard and recommended way to ensure a pattern applies to an entire directory tree (including all nested files) is to use the recursive wildcard `` [3][1]. To ensure files like tests/services/test_llm_service.py are matched, you should use: [tool.ruff.lint.per-file-ignores] "tests//.py" = ["<RULE_CODE>"] Using `tests//.py` explicitly enables recursive matching, ensuring that all Python files within the `tests/` directory and its subdirectories are covered [1]. For debugging or verifying whether a specific file is being matched as you expect, you can run `ruff check --show-settings <path/to/file.py>` to see how the patterns are resolved [4].
Citations:
- 1: https://openillumi.com/en/en-ruff-per-file-ignores-pyproject-toml/
- 2:
per-file-ignoressingle-level glob patterns match files in subdirectories astral-sh/ruff#21425 - 3: Negation of per-file-ignores not working as expected astral-sh/ruff#13716
- 4: Question: How can I ignore a rule in one directory? astral-sh/ruff#2457
🌐 Web query:
Ruff documentation per-file-ignores glob patterns recursive directory matching tests/* tests/**/*.py
💡 Result:
In Ruff, the per-file-ignores setting supports glob patterns to match files or directories for which specific lint rules should be disabled [1][2]. For recursive directory matching—where you want to apply ignores to all files within a directory and its subdirectories—the standard approach is to use the double wildcard pattern [2][3]. Recommended Patterns: 1. Recursive Matching (Recommended for most test directories): To ignore rules for all Python files throughout an entire directory tree, use the pattern "tests//.py" [2]. [tool.ruff.lint.per-file-ignores] "tests//.py" = ["D"] 2. Non-Recursive Matching: Using "tests/" targets files directly within the "tests/" directory [2]. Depending on your specific Ruff version and configuration, this may not consistently cover nested subdirectories [2][4]. Key Implementation Details: - Location: This setting is typically configured in your pyproject.toml or ruff.toml file under the [tool.ruff.lint.per-file-ignores] section [5][1][2]. - Syntax: The patterns are evaluated as glob patterns [1][2]. You can specify a list of rule codes or prefixes to ignore for the matched files [1]. - Negation: You can use an initial '!' to negate a file pattern if needed [1]. - Consistency: While some simpler glob patterns might behave differently across tool settings (like exclude vs. per-file-ignores), using the explicit recursive syntax (/.py) is the most reliable way to target all files in a directory hierarchy [2][6].
Citations:
- 1: https://docs.astral.sh/ruff/settings/
- 2: https://openillumi.com/en/en-ruff-per-file-ignores-pyproject-toml/
- 3: https://stackoverflow.com/questions/77680073/ignore-specific-rules-in-specific-directory-with-ruff
- 4:
per-file-ignoressingle-level glob patterns match files in subdirectories astral-sh/ruff#21425 - 5: https://docs.astral.sh/ruff/configuration/?featured_on=pythonbytes
- 6: File pattern syntax differs between exclude and per-file-ignores astral-sh/ruff#10147
🌐 Web query:
site:docs.astral.sh/ruff per-file-ignores glob pattern recursive nested subdirectories tests/*
💡 Result:
To apply per-file-ignores recursively to all files within a tests directory and its nested subdirectories, you should use the globstar pattern in your configuration. In your configuration file (e.g., pyproject.toml or ruff.toml), define the per-file-ignores setting using a pattern like the following [1][2][3]: [tool.ruff.lint.per-file-ignores] "tests/" = ["RULE_CODE"] The pattern ensures that the rule is applied recursively to all files within the tests directory and any of its nested subdirectories [4]. If you need to target specific file types within those directories, you can append the extension, for example: "tests//*.py" = ["RULE_CODE"] This configuration instructs the linter to exclude the specified rule codes for any matching file found at any depth under the tests directory [2].
Citations:
- 1: https://docs.astral.sh/ruff/configuration/
- 2: https://docs.astral.sh/ruff/settings/
- 3: https://docs.astral.sh/ruff/configuration/?featured_on=pythonbytes
- 4: https://docs.astral.sh/ruff/rules/glob/
🌐 Web query:
Ruff glob rules recursive pattern per-file-ignores tests/**/*.py tests/** documentation
💡 Result:
In Ruff, the per-file-ignores setting uses glob patterns to identify files for which specific lint rules should be disabled [1]. You can achieve recursive matching using the double-asterisk () wildcard [2]. For recursive matching in your examples, the following patterns are correct: - To target all Python files recursively within a `tests` directory: `"tests//*.py" = [...]` [2]. - To target an entire directory tree (such as `tests` or `documentation`), you can use `"tests/" = [...]` or `"documentation/" = [...]` [3][2]. Key details for configuring `per-file-ignores`: Glob Pattern Behavior: - Single-level vs. Recursive: Patterns like `tests/*` are non-recursive and match only files immediately within the `tests/` directory [2]. Patterns using (e.g., tests//*.py or tests/) are recursive and match the entire directory hierarchy [2]. - Brace Expansion: You can group multiple paths using brace expansion to avoid repetition, for example: "/{tests,documentation}/*" = [...] [3][4]. - Negation: An initial ! negates the pattern [1]. For instance, "!src/.py" = ["D"] ignores D rules everywhere except for the src/ directory [1]. Configuration: - You define these mappings in your pyproject.toml or ruff.toml file under [tool.ruff.lint.per-file-ignores] [1][4]. - Patterns are generally resolved relative to the configuration file [5]. Note on path matching: Ruff's globbing implementation for per-file-ignores may occasionally behave differently than standard exclude patterns regarding directory boundaries [6][7]. If you encounter unexpected matching, verify your settings by running ruff check --show-settings to inspect how the glob patterns are resolved [6][5].
Citations:
- 1: https://docs.astral.sh/ruff/settings/
- 2: https://openillumi.com/en/en-ruff-per-file-ignores-pyproject-toml/
- 3: Multiple paths for ignoring same rules in lint.per-file-ignores astral-sh/ruff#17166
- 4: https://docs.astral.sh/ruff/configuration/
- 5:
per-file-ignorespaths fromextendare relative to the extended configuration file astral-sh/ruff#8795 - 6: File pattern syntax differs between exclude and per-file-ignores astral-sh/ruff#10147
- 7:
per-file-ignoressingle-level glob patterns match files in subdirectories astral-sh/ruff#21425
🌐 Web query:
Ruff per-file-ignores glob syntax tests/** tests/ tests/**/*.py official docs
💡 Result:
Ruff's per-file-ignores setting supports glob patterns to match files or directories for which specific lint rules should be ignored [1][2][3]. These patterns are defined in your configuration file (e.g., pyproject.toml or ruff.toml) under the [tool.ruff.lint.per-file-ignores] section [1][4][5]. Key syntax points: - Recursive matching: Use to match files recursively through subdirectories [3][5]. For example, "tests//.py" matches all Python files within the tests directory and any of its subdirectories [5]. - Brace expansion: Ruff supports brace expansion, allowing you to group multiple paths or extensions [1][6]. For example, "/{tests,docs,tools}/" ignores violations in any of those directories [1][4]. - Negation: You can use a leading! to negate a pattern [2][3]. This is useful for "ignore everywhere except" scenarios, such as "!src/.py" = ["D"] to ignore Pydocstyle (D) rules in every file except those in the src directory [2][3]. - Path context: Patterns are relative to the project root (the directory containing your configuration file) [7]. - Configuration: You map these patterns to a list of rule codes or prefixes [2][5]. Example configuration (pyproject.toml): [tool.ruff.lint.per-file-ignores] # Ignore D (Pydocstyle) in all files within the tests directory recursively "tests//.py" = ["D"] # Ignore E402 in specific directories using brace expansion "/{tests,docs,tools}/" = ["E402"] # Ignore D everywhere except the src directory "!src/.py" = ["D"] Note that file pattern syntax in Ruff can occasionally behave differently depending on whether you are using per-file-ignores or the top-level exclude setting [8]. It is recommended to verify your configuration using ruff show-settings if you encounter unexpected matching behavior [8].
Citations:
- 1: https://docs.astral.sh/ruff/configuration/?featured_on=pythonbytes
- 2: https://docs.astral.sh/ruff/settings/
- 3: https://github.com/astral-sh/ruff/blob/main/crates/ruff_workspace/src/options.rs
- 4: https://github.com/astral-sh/ruff/blob/b16565ef/docs/configuration.md
- 5: https://openillumi.com/en/en-ruff-per-file-ignores-pyproject-toml/
- 6: Multiple paths for ignoring same rules in lint.per-file-ignores astral-sh/ruff#17166
- 7: https://github.com/astral-sh/ruff/blob/df88504dea5f15a10a5cdc765e5d0a80a6f85420/README.md
- 8: File pattern syntax differs between exclude and per-file-ignores astral-sh/ruff#10147
🌐 Web query:
site:docs.astral.sh/ruff per-file-ignores tests/* recursive glob official
💡 Result:
In the context of the Ruff linter, you can use recursive globbing within the per-file-ignores setting [1][2]. Ruff supports standard glob patterns for file matching in the per-file-ignores configuration [3][1]. To ignore rules recursively across all files within a specific directory, you can use the double asterisk globstar pattern [1][2]. For example, to ignore rule E402 for all files within a tests directory and its subdirectories, you can use the following configuration in your pyproject.toml: [tool.ruff.lint.per-file-ignores] "/tests/" = ["E402"] Alternatively, you can group multiple directories: [tool.ruff.lint.per-file-ignores] "/{tests,docs}/" = ["E402"] Note that while the original Flake8 tool also supports per-file-ignores, its pattern matching capabilities are more limited compared to Ruff, and it does not natively support recursive globstar patterns in the same way. If you are specifically looking for recursive support, Ruff is the recommended, modern alternative that provides this functionality [1][2].
Citations:
- 1: https://docs.astral.sh/ruff/configuration/?featured_on=pythonbytes
- 2: https://docs.astral.sh/ruff/configuration/
- 3: https://docs.astral.sh/ruff/settings/
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '%s\n' '--- pyproject.toml excerpt ---'
sed -n '110,124p' pyproject.toml
printf '\n%s\n' '--- tests tree (depth 3) ---'
find tests -maxdepth 3 -type f | sort | sed -n '1,200p'Repository: Finntegrate/tapio
Length of output: 1895
🌐 Web query:
site:docs.astral.sh/ruff glob patterns recursive double asterisk per-file-ignores official docs
💡 Result:
In glob patterns, the double asterisk is known as a globstar or recursive wildcard [1]. It matches any directory, subdirectory, or file recursively within the specified path [1]. When used in the per-file-ignores configuration of tools like the Ruff linter, the pattern allows you to apply rule exclusions to files across a directory tree [2][3]. For example, a pattern like "/{tests,docs,tools}/" will match and apply the specified ignore rules to all files nested within the tests, docs, or tools directories, regardless of their depth in the project structure [2][3]. Key points regarding the recursive double asterisk in this context: - Recursive Matching: Unlike a single asterisk (), which typically matches files only within a single directory level, enables deep, recursive traversal of the file system [1]. - Configuration Usage: It is commonly used in configuration files (such as pyproject.toml or ruff.toml) to selectively disable specific linting rules for large portions of a project, such as test suites or generated documentation, without needing to list every individual file or subdirectory [2][3]. - Paths: Patterns are generally treated as relative to the project root, meaning at the start of a pattern refers to the entire directory structure relative to that root [4][2].
Citations:
- 1: https://docs.astral.sh/ruff/rules/glob/
- 2: https://docs.astral.sh/ruff/configuration/
- 3: https://docs.astral.sh/ruff/configuration/?featured_on=pythonbytes
- 4: https://docs.astral.sh/ruff/settings/
🌐 Web query:
site:docs.astral.sh/ruff "per-file-ignores" "tests/*" Ruff
💡 Result:
In Ruff, you can use the lint.per-file-ignores setting to disable specific lint rules for files or directories that match a given pattern [1][2]. To ignore rules for files within the tests directory, you can add an entry to your pyproject.toml or ruff.toml file under the [tool.ruff.lint.per-file-ignores] section [1][2]. Example configuration: [tool.ruff.lint.per-file-ignores] "tests/" = ["E402", "F401"] In this example, the rules E402 and F401 will be ignored for all files contained within the tests directory [1][2]. Key details: - Pattern Matching: Ruff uses glob patterns [2]. "tests/" will match files directly inside the tests directory, while "/{tests}/" or "tests//" can be used to match files recursively within the tests directory and its subdirectories [1][2]. - Configuration: This setting must be placed under [tool.ruff.lint.per-file-ignores] if using pyproject.toml [1][2]. - CLI: While there is no dedicated CLI flag for every configuration, you can use the --per-file-ignores flag or the --config flag to override these settings from the command line [1].
Citations:
🌐 Web query:
Ruff glob matching does * match slash per-file-ignores official docs
💡 Result:
In Ruff, the behavior of glob matching—specifically whether the asterisk () character matches path separators (slashes)—can be complex and has been a subject of ongoing discussion and refinement [1][2]. Officially, Ruff's glob implementation uses the globset crate [1]. Historically, the default behavior of the globset library allows * to match path separators, which means a pattern like src/.py might inadvertently match files in subdirectories (e.g., src/subdir/file.py) in addition to files directly within the directory [1][2]. Key points regarding Ruff's glob matching in per-file-ignores: 1. Path Separator Matching: Because * matches path separators by default, patterns intended to be single-level often behave as multi-level, recursive matches [1][2]. Users have reported this as an issue where, for example, src/.py matches files deeper in the directory tree than expected [2]. 2. Limitations and Configuration: There is no simple toggle in the current user-facing configuration to disable this "match-slash" behavior [1]. While internal efforts to apply literal_separator(true) to glob patterns (which restricts * from matching slashes) have been discussed and implemented in some parts of the codebase, behavior may remain inconsistent across different settings like exclude and per-file-ignores [2][3]. 3. Recommended Practice: To avoid ambiguity or unexpected matches, use explicit recursive patterns like /.py if you intend to match all files in a tree, or target files more specifically [4][5]. If you need to match files only at a specific depth, the current implementation's tendency to treat * as matching slashes necessitates caution [2]. Because Ruff's internal glob handling is evolving, and behaviors can vary depending on the specific version and configuration context (e.g., CLI vs. file-based settings), it is best to verify your patterns using the specific Ruff version you are running if you require strict path-level matching [6][7][2].
Citations:
- 1: exclude pattern with
*in file name incorrectly matches entire file path astral-sh/ruff#6262 - 2:
per-file-ignoressingle-level glob patterns match files in subdirectories astral-sh/ruff#21425 - 3: https://github.com/astral-sh/ruff/blob/99d97bd7/crates/ty_project/src/glob/exclude.rs
- 4: https://docs.astral.sh/ruff/settings/
- 5: https://docs.astral.sh/ruff/configuration/?featured_on=pythonbytes
- 6: File pattern syntax differs between exclude and per-file-ignores astral-sh/ruff#10147
- 7: Extreme performance penalty when using
--extend-per-file-ignoresand leading recursive glob patterns astral-sh/ruff#8626
🌐 Web query:
site:docs.astral.sh/ruff/settings/ "per-file-ignores" "tests/*" "subdirectories"
💡 Result:
In tools like Ruff, the per-file-ignores setting supports recursive matching within subdirectories by using glob patterns [1]. To apply ignore rules to all files within a specific directory and its subdirectories, you can use the double-star glob pattern [1]. For example, to ignore specific rule codes for all files within a tests directory and any of its subdirectories, you would use the following configuration in your pyproject.toml or ruff.toml file [1]: [tool.ruff.lint.per-file-ignores] "tests//.py" = ["RULE1", "RULE2"] Using "tests/" typically matches files directly within the tests directory, whereas "tests//*.py" ensures that the pattern recurses into all nested subdirectories [1]. You can also use the! prefix to negate patterns or define complex inclusion/exclusion logic as needed [1].
Citations:
Make the Ruff test ignore recursive. tests/* only matches files directly under tests/; nested test files under tests/config/, tests/services/, etc. still need these ignores. Use tests/**/*.py here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pyproject.toml` around lines 110 - 124, The Ruff per-file ignores for tests
are only applied to files directly under tests, so nested test modules still get
linted. Update the [tool.ruff.lint.per-file-ignores] pattern from tests/* to a
recursive tests/**/*.py pattern so all test files under tests, including
subdirectories, inherit the same ignores.
| markdown_files.append(file_path) | ||
| except Exception as e: | ||
| logger.error(f"Error finding markdown files: {e}") | ||
| except ValueError, IndexError: |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
Critical: invalid except syntax breaks module import.
except ValueError, IndexError: is Python 2 syntax and raises a SyntaxError under Python 3 (parsed as except ValueError as IndexError, which is itself invalid here). This makes the entire markdown_utils module unimportable, which cascades into the vectorizer flow and every test importing it. Wrap the exception types in a tuple.
🐛 Proposed fix
- except ValueError, IndexError:
+ except (ValueError, IndexError):
# Skip files that don't match the site structure
passThe graph context shows the same construct in tapio/utils/__init__.py. Please verify and fix it there too:
#!/bin/bash
# Find the invalid Python 2-style except clause across the repo
rg -nP 'except\s+\w+\s*,\s*\w+\s*:' --type=py🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tapio/utils/markdown_utils.py` at line 71, The Python 2-style exception
clause is invalid under Python 3 and prevents module import; update the affected
except blocks in markdown_utils and the matching clause in utils/__init__.py to
use a proper multi-exception form. Locate the bad syntax in the
exception-handling logic around the markdown utility parsing path and any shared
import-time helpers, then replace the comma-separated exception list with a
tuple so the modules remain importable.
- crawler.py: fix path-traversal check that used string-prefix matching
on resolved paths (abs_full_path.startswith(abs_output_dir)), which a
sibling directory sharing the same prefix could bypass. Use
Path.is_relative_to() instead.
- parser.py: stop replacing every ".html" substring in the output path
with str.replace(); strip only the trailing extension via
Path.with_suffix("").
- document_retrieval_service.py: stop logging raw user query text at
INFO level; log only its length.
- text_utils.py: fail fast with ValueError when chunk_overlap >=
chunk_size, instead of letting it produce a zero/negative chunking
step size downstream.
- ci.yaml: pin third-party actions (checkout, setup-uv, cache) to full
commit SHAs instead of floating version tags, for supply-chain
stability.
- README.md: fix awkward grammar in the all-contributors sentence.
- .vscode/extensions.json: recommend the Ruff extension, matching the
linter this project now standardizes on.
Verified and left unchanged (false positives / not actionable):
- pyproject.toml per-file-ignores "tests/*": empirically confirmed this
already matches nested test files (e.g. tests/config/test_*.py) --
ruff's glob matching for per-file-ignores isn't shell-glob semantics.
- markdown_utils.py "except ValueError, IndexError:": valid Python 3.14
syntax (PEP 758 parenthesis-free multi-exception except), not Python
2-style. Confirmed the module imports and the test suite exercises it
successfully; this project requires Python >=3.14.
- .pre-commit-config.yaml pyupgrade --py313-plus: the pinned pyupgrade
version (3.21.2) has no --py314-plus flag yet; --py313-plus is its
highest supported target.
- test_chroma_store.py float comparison: already fixed in a prior
commit (uses pytest.approx).
|
Verified as false positives / not actionable, left unchanged: tests/* per-file-ignore pattern — empirically confirmed it already matches nested test files; ruff's glob semantics aren't shell-glob |
|
Tip For best results, initiate chat on the files or code changes.
🐇📝 ✏️ Learnings added
|
|



Summary
pre-committoprek(faster, drop-in compatible) as the git hook runner, and run those same checks in CI so they're enforced regardless of whether a contributor has the local hook installed.E/F/I/UPset to ~50 rule categories (bugbear, bandit, pathlib, pydocstyle, simplify, pylint, tryceratops, type-checking, and more) and fix every resulting finding — every ignore (global and per-file) carries an inline comment explaining why.Notable fixes beyond mechanical cleanup
%-formatting; root-logger calls replaced with named loggers;.error()→.exception()in except blocks so tracebacks are actually captured.os.pathcall sites converted topathlib.Pathacross the crawler/parser/vectorizer/config modules.TYPE_CHECKING(TC001-003) after it caused a realNameError— gradio's event-handler wiring eagerly resolves type hints, and this project has nofrom __future__ import annotationsto make that safe. Documented and ignored instead.ChromaStore's reliance on langchain's private_collectionattribute, rewritingquery_with_embedding/get_documentagainst the publicChromaAPI.LLMService/RAGOrchestrator: thehistoryparameter was flagged as unused (ARG002) but was actually wired in from the Gradio app and never consumed. Threaded it into the Ollamamessageslist (capped viaMAX_HISTORY_MESSAGES) and fixed a double-counting bug whererespond_streamincluded the just-appended current message in history.check_model_availability,BaseCrawler._crawl_url, and theparseCLI command via extraction, not threshold-raising.ruff check --fix/ruff formatsteps (which silently passed PRs with violations) withprek run --all-files, the same gate used locally.Test plan
uv run pytest— 179 passeduv run ruff check ./uv run ruff format --check .— cleanuv run mypy tapio— cleanuv run pyrefly check— cleanprek run --all-files— all hooks pass🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Chores